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Discrimination,  Allocatory  and  Separatory,  Linear  Aspects 


by 

Seymour  Geisser 


1.  Introduction 

The  classification  of  objects  is  one  of  the  hoariest  and  consequently 
not  the  least  primitive  of  scientific  enterprises  - certainly  considerably 
removed  from  preciser  mechanisms  directed  towards  the  explanation  and 
prediction  of  natural  and  social  phenomena.  Briefly,  it  attempts  to  sort 
out  in  some  sensible  manner  objects  belonging  to  two  or  more  labeled 
ises.  When  this  involves  a parsimonious  and  efficient  criterion  of 
choice  based  on  related  manifest  attributes,  we  are  in  the  realm  o*  Discrimi- 
nation . 

An  early  recorded  instance  appears  in  the  biblical  book,  Judges,  XII, 
5-6.  A clan  of  Israelites  from  Gilead  held  the  fords  over  the  Jordan  to 
prevent  the  defeated  troops  of  Ephraim,  another  Israelite  tribe,  from  cross- 
ing the  river.  The  Ephraimites  sharing  race,  language,  customs  and  dress 
were  apparently  indistinguishable  in  all  respects  from  the  Gileadites. 

Seizing  upon  a dialectical  variation  as  an  efficient  sorting  device,  the 
guards  made  those  attempting  the  ford  pronounce  the  word  "Shibboleth". 

Upon  hearing  "Sibboleth"  they  were  fairly  certain  of  apprehending  an 
Ephraimite. 


It  is  quite  likely  that  their  errors  of  classification  were  no  greater 
than  many  of  our  current  weather  classifiers  aided  by  a modern  computer, 
who  base  forecasts  of  3now  on  a large  number  of  precisely  determined 
variables.  This  is  of  course  a situation  where  the  label  has  in  fact 
not  yet  occurred  but  is  predictive  as  opposed  to  the  previous  retro- 
dictive  case. 

Often  in  the  latter  case  the  latent  label  of  a new  object  can  only 
be  ascertained  with  certainty  by  prodigeous  technical  effort  which  may 
even  involve  the  destruction  or  alteration  of  the  object  rendering  it  use- 
less for  further  inquiry.  Other  cases  may  require  an  inordinate  amount 
of  time  and  patience  until  the  label  eventually  reveals  itself.  Hence 
the  utilization  of  easily  assessed  related  attributes  may  be  of  invaluable 
aid  in  a study  if  only  for  reasons  of  economics  and  prudence. 

There  is  also  a natural  hierarchy  in  terms  of  how  these  problems  can 
be  organized.  In  the  least  informative  situations,  the  number  of  classes 
as  well  as  the  labels  are  unknown,  and  it  is  hoped  that  clues  to  both 
these  entities  will  be  disclosed  by  some  set  of  appropriate  manifest  attri- 
butes. Here  the  basic  problem  is  determining  the  number  of  classes  and  of 
forming  clusters.  In  more  informative  cases  the  number  of  classes  or  popu- 
lations is  known  or  specified.  Further  knowledge  is  often  also  presumed 
concerning  certain  aspects  of  the  attribute  distributions. 

For  the  sake  of  clarity  we  set  down  the  general  problem  as  follows: 
There  are  populations  (or  patterns)  TL  , j=l,...,r  , with  r known  or 

unknown  and  TI\  possibly  specified  by  moments  or  by  a distribution  function 

1 th 

F j ( • 1 9 j ) , whose  form  may  be  known  or  unknown,  and  0^  is  the  j n set  of 

known  or  unknown  parameters.  There  may  be  certain  relationships  among  the 


_ o _ 


iTj  . as  well  as  subpopulations  fl^  . Further  there  are  two  sets  of 
observations,  the  first  denoted  by  X and  the  second  by  U , (either  set 
nay  be  empty).  Each  of  the  observations  belonging  to  X is  such  that  its 
population  origin  or  label  is  known  with  certitude,  but  the  labels  of 
those  belonging  to  U are  not.  These  may  have  some  prior  probabilities 
attached  to  them  before  they  are  observed  and  one  object  of  the  endeavor 
is  to  determine  their  origin  in  some  optimal  manner.  Here  allocation  is 
the  goal.  A second  goal,  which  may  be  primary  in  certain  studies,  is 
basically  descriptive  (graphical,  algebraic  or  some  other  qualitative  form), 
and  involves  initially  the  disclosure  of  the  manifest  differential  fea- 
tures of  the  patterns,  populations  or  potential  populations  under 
scrutiny.  The  purposes  of  the  first  are  action  oriented,  predictive  or 
retrodictive  while  the  latter  is  more  in  the  realm  of  the  speculative  in 
terms  of  possibly  throwing  some  light  on  scientific  or  social  issues. 

In  the  first  case  one  attempts  to  derive  some  rule  which  optimally 

allocates  new  observations  while  in  the  second  instance  one  te..ds  to  focus 

% 

on  functions  (discriminants)  which  tend  to  maximally  distinguish  or  separate 
the  populations.  An  appropriate  allocatory  procedure  requires  prior  proba- 
bilities of  an  observation  belonging  to  one  or  another  population  or 
estimates  thereof.  Often  they  are  not  obtainable  and  one  tacitly  assumes 
chat  these  prior  probabilities  are  equal.  In  many  cases  this  is  tantamount 
to  using  a separatory  function  as  an  allocator  and  the  two  original  distinct 
goals  tend  to  fuse  or  become  blurred.  Allocatory  optimality  is  basically 
definable  only  when  stringent  assumptions  are  met  while  in  vague  situations 
a separatory  function  may  sometimes  usefully  serve  as  an  allocator.  Con- 
versely, allocatory  notions  may  also  be  used  to  define  a separatory 
functi on. 
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Discrimination,  in  its  modern  guise,  w.13  founded  by  R.  A.  Fisher 
(193'5)*  He  derived  those  linear  functions  of  the  class  of  all  linear 
functions  that  best  separated  populations  (actually  samples)  in  terms 
of  maximiztr.g  a c un  distance  function  depending  on  only  the  first 
two  moments . 

Since  then,  linear  discriminants  have  played  an  important  role  in 
the  theory.  From  other  points  of  view  it  was  also  found  that  linear 
theory  was  preeminent  in  one  of  the  most  useful  of  di.\  ..ributions , the 
multivariate  normal,  Wald  (l9^k) , Welch  (1939). 

In  this  paper  we  shall  present  not  only  an  exposition  of  linear 
discrimination  but  shall  also  attempt  to  give  a coherent  discussion  of 
its  twin  goals  - allocation  and  separation. 

In  the  next  few  sections  we  review  linearity  in  the  multivariate 
normal  case,  discuss  the  extent  to  which  linearity  is  optimal  and  indi- 
cate the  actual  use  of  linear  discriminants.  This  is  followed  by  a 
section  in  which  the  distributional  assumptions  are  dropped  and  the 
thrust  is  on  the  separation  of  populations  via  linear  functions.  An 
incidental  feature  is  that  some  of  the  basic  results  are  derived  alge- 
braically in  a manner  which  differs  from  customary  derivations.  The 
penultimate  section  is  devoted  to  the  application  of  sample  reuse  pro- 


cedures to  linear  discriminants. 


2.  Multivariate  Normal  Case  — Allocation  and  Reduction 

Suppose  there  are  p -dimensional  multivariate  populations  tT^>...>7Tr 
with  vector  means  and  common  positive  definite  covariance 

matrix  £ • One  is  interested  in  allocating  a new  p-dimensional  observa- 
tion u to  one  of  these  various  populations  in  some  optimal  fashion.  Assuming 

r 

u has  prior  probability  q,  of  belonging  to  jr, , £ q,=l  , then  the  optimal 

1 1 1=1 

method  for  multivariate  normal  populations  with  regard  to  total  posterior 
probability  of  correct  classification  (PCC),  c.f.  Anderson  (1958)  is  to  allo- 
cate u to  that  tT^  which 

w^p)  = log  qt-%n2{p)  , i-1 (2.1) 

is  a maximum  where 

Dj(p)  * (u-^)  ' £'1(u-y,i)  , (2.2) 

the  Mahalanobis  distance.  This  is  the  solution  which  allocates  u to  that 
which  has  maximum  posterior  probability  since  ’ w^(p)  is  easily  shown 
to  be  a monotone  function  of  Pr  [fiju]  , the  posterior  probability  that 
u is  from  t]^  . 

It  is  sometimes  of  interest  to  determine  whether  we  can  transform  linearly 
the  set  of  p variables  into  k s p variables  and  preserve  the  allocation  in 
k dimensions.  Let  y 55  Cu  , “Cy,  , n ■ c£c  ' , for  C , a k x p matrix 
of  rank  k s p , and 

d f(k)  * - (u-^'c'^c')"1^-^)  , (2.3) 
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the  corresponding  distance  in  k dimensions. 

r _i5 

Assume  g = Z)  ( p,.  - p.)  ( g, . - p,)  where  n = r Lj  p,.  and  g is  of  rank 

i3l  i=i  1 

r-v  i,  p , noting  that  when  are  linearly  independent  v“l  . 

Since  g is  a p.s.d.  matrix  there  exists  a A such  that,  g = AA'  , 

where  A is  p x r-v  . If  we  let  C * P'A'S  1 where  k - r-v  and  P , 

the  r-v  y r-v  .orthogonal  matrix  such  that  P'A'E  *AP  is  diag(g1> . . . y) 

where  gj  are  the  non-zero  roots  of  2 in  descending  order,  then 

D*(k)  « (uiii)'S"1AlA'ZT1Ar1A'2‘1(u-ui)  • (2. It) 

Further  by  adding  and  subtracting  y,  in  u-  and  noting  that  is 

in  the  vector  space  generated  by  AA'  it  is  easily  shown  that  for  all  i 

w^pj-w^k)  - D*(p)-D*(k)  - (u-H)'[S“1-irlAlA'2"lA]'1A'E"lKu-'W  (2-5) 

and  hence  is  independent  of  i . Therefore  allocation  of  y by  means  of 
the  maximum  w^(k)  is  equivalent  to  the  original  allocation  of  u , thus 
verifying  that  C 13  P^A^E  * is  a solution  that  preserves  the  original  allo- 
cation. The  new  set  of  coordinates  y are  referred  to  as  the  complete  set 
of  linear  multiple  discriminants  and  they  contain  all  of  the  discriminatory 
power  of  the  original  set  of  coordinates.  The  set  y is  an  orthogonal  set 
and  forms  a basis  for  all  other  solutions  y * Ry  where  R is  any  real  non- 
singular  k x k matrix.  On  the  other  hand  if  k < r-v  , the  allocation  by  y , 
the  transform  of  u , will  not  be  the  same  as  the  allocation  by  u for  all  u , 
as  can  easily  be  verified. 

The  total  probability  that  u will  be  correctly  allocated  by  the  procedure 
is 
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q(p)  ■ S q<Prluai|TTi] 

t=I  * 


where  R is  the  region  given  by  those  u satisfying  Max  w.(p)  - w.(p)  , 

J J -1 

and  is  a maximum  with  respect  to  all  possible  procedures.  If  y “ P'A'Z'  u 


then  it  is  clear  that 


q(p)  " q(r-v)  - S q±Pr [y€Rt |TTi  ] 


(2.7) 


where  R*  is  given  by  those  y satisfying  Max  w( r-v)  » w , (r-v)  . To 

-1  j J 1 

simplify  matters  let  q^  * r for  all  i so  that 


q(p)  " q(r-v)  * r"  £ Pr[y€Ri|lTt ] 

i=l  ‘ 1 


(2.8) 


then  R^  and  R^  are  given  by  u and  y which  minimize  D£(p)  and 
D^(r-v)  respectively.  When  k < r-v  and  we  use  the  procedure,  i.e.,  mini- 
mizing DJ-(k)  of  (2.3)  where  y - Cu  , then  q(k)  < q(r-v)  by  continuity 
arguments.  On  the  other  hand  it  might  be  conjectured  that  the  best  one  can  do 
with  respect  to  maximizing  q(k)  is  to  let  C * P^A'S  ^ where  ^(k)=  (^••••^fc) 

is  the  matrix  of  the  first  k columns  of  P , i.e.,  P_L  is  the  invariant  vector 

th  -1 

associated  with  the  i largest  root  of  A 'T  A , or  equivalently  APt  is  the 
invariant  vector  associated  with  the  identical  root  of  $£  *.  This  conjecture 
is  in  general  false  whenever  r-v  £ 2 if  we  wish  to  maximize  q(k),  as  a 
counterexample  will  3how,  But  from  another  point  of  view,  i.e.,  optimizing 
on  separatory  criteria  which  we  shall  discuss  in  Section  5>  It  can  be  best. 

A further  note  of  caution  should  be  introduced  to  the  effect  that  the  PCC 
is  only  of  value  in  assessing  the  discriminatory  power  of  the  manifest  variables 


7-  • 


at  hand  prior  to  the  observation  of  u . Once  a set  of  such  variables  is 
determined  and  a particular  u observed,  the  only  relevant  factor  is  the 
posterior  probability,  when  calculable,  that  u belongs  to  one  or  another 

TTj » • • • ilTj.  » 

r 

PrlTT/|ul  = q.f.(u)/£q  f (u)  (2.9) 

J J J i=l  1 1 


where  fj(*)  represents  in  general  the  probability  function  associated  with 


IT 


J * 


We  shall  now  describe  the  aforementioned  counterexample.  Suppose  we  ask 


for  a single  linear  combination  that  will  maximize  the  PCC  assuming  the  r 


-1 


populations  all  have  equal  prior  probability  r , blurring  the  distinction 
between  allocation  and  separation.  Then  c'u  , under  jT j is  univariate 
normal  with  mean  c'^  and  variance  c'Dc  . Then  z » c 'n/Jc.  '£  c is 
under  n\.  , N( 1)^,1)  where  Tlj3  c'^/^'Sc  . Hence  we  can  calcuiate  the 
maximal  probability  of  correct  classification  for  any  c 


ir‘1 
PCC  = 2r  Y,$( 

i=l 


+(2.r)r-t 


(2.10) 


where  $ is  the  distribution  function  of  a standardized  normal  variate  and 
7)( i)  are  the  ordered  values  of  ^ such  that  !)(].)*  T)^)*  "*'3t  ^(r)  * 
Maximization  of  the  PCC  with  respect  to  c is  troublesome,  but  it  can  be 
shown  that  the  c that  maximizes  PCC  is  not  necessarily  the  vector  associ- 
ated with  the  largest  root  of  1 as  one  might  initially  suspect.  Such  a 

suspicion  of  course  would  arise  from  the  fact  that  this  vector  does  maximize 
the  variation  amongst  the  T)^  • While  this  variation  is  contributory,  the 
PCC  is  also  quite  sensitive  to  the  spacing  amongst  the  T)^  . 
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The  following  example  demonstrates  these  facto:  Let  u be  a 3 x 1 

vector  with  means  under  , IT  and  TT,^  , respectively 

= (1,0,0)  , u.'  = (0,1, -l)  , n (-1,-1, i)  and  £ = I . 

Then  £S  7 j * » f}  and 


with  characteristic  roots  3 + */3»  0,  3 “ • The  normed  vector  associated 

with  the  largest  root  is  cj  = (6  - 2^  ^ (*/3  - 1,  1,  -l).  Using  c|u  we 

find  that  (T)^,  ^(2)’  ^(3)^  = ~ 2^)  ^ (2,  Jz  -1,  -1  - J^)  and  compute 

the  PCC  to  be  .67757*  On  the  other  hand  the  simple  normed  vector 


(0,  — “ , —■)  yields  (T)^,  T\(py  11(2))  = Cy5,  0,  -JE),  e.  lly  spaced, 
w&  v 2 " 


and  results  in  a PCC  of  .68033,  which  is  just  a trifle  larger  than  that  at- 
tained by  the  vector  associated  with  the  maximum  root  of  0 . In  actual  fact 
the  normed  vector  c •-  (.173,  *697,  -.697)  leads  to  (l}^,  T)^,  T)^)  “ 

( 1 . 394 , .173,  -I.567)  and  yields  a PCC  of  .69139  which  is  the  maximum  attain- 


able here  for  a single  linear  combination.  It  is  well  known  that  the  dis- 
r 

per3ion,  £(31.-1))"  attains  its  maximum,  3 + JS  - 4.732  in  this  case, 
i=l 


when  the  vector  associated  with  the  largest  root  is  utilize  3,  while  the  same 
measure  of  dispersion  for  the  vector  c^  is  4 - considerably  less,  and  for 
the  vector  c we  obtain  4.42. 

Another  way  of  viewing  this  problem  is  to  realize  that  we  are 
basically  maximizing  two  quite  different  functions  of  the  ordered  values  of 


, ....  r with  respect  to  the  arbitrary  vector  c . One 

- 9 - 


Tjj  = Tj  c , j*l 


function  is  given  by  (2.10)  while  the  other  is 


. (2.11) 


That  the  characteristic  vector  associated  with  the  largest  root  maximizes  (2.1l) 
results  from  the  fact  that  (2.11)  is  invariant  with  rega, 1 to  the 
ordering  of  the  T|j  so  that  from  the  definition  of  ^ we  obtain 


£(71(l)-7l)2-S(7irTl)2 

j«l  j=l  J 


c'pc 

c'Sc 


» 


(2.12) 


As  is  well  known,  the  quantity  on  the  right  of  (2*12)  is  maximised  when  c 

is  set  equal  to  the  characteristic  vector  associated  with  the  largest  root  of 
-1 

pZ)  * Hence  there  is  really  no  reason  to  expect  the  same  solution,  for  both 
cases. 

We  note  that  when  r =»  2,  the  optimal  allocatory  procedure  yields  the 
single  linear  discriminant 

U = (u  - £(p^  + pi 2))'  - \ig)  + log  ~ (2.13) 


* # 

such  that  U > 0 assigns  u to  rf.^  and  U < 0 assigns  u to  tt^. 

Insertion  of  the  usual  estimates  for  ‘p^,  pig  and  Tj  when  they  are  unknown 
and  estimable  from  data  yields  the  plug-in  rule 

v ■ (u  + UgTIT1(pi1-  pig)  + log  — (2.14) 

* 

with  V > 0 assigning  u to  rr^  and  Ug  otherwise. 

It  will  be  shown,  however,  that  from  certain  points  of  view,  even 
in  this  most  structured  of  cases,  linear  theory,  strictly  speaking,  may  be 
inappropriate  though  approximately  correct  and  certainly  convenient. 
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3.  The  Liraito  of  Linear  Theory  - Allocatory  Aspects 

For  the  remainder  of  our  discussion  we  shall  restrict  ourselves  to 
the  two  population  case;  and  TT^  with  density  function  f ( * 1 0^ ) and 

f(*|  0^)  respectively.  Optimal  allocation  for  a new  observation  u with 
regard  to  the  PCC  involves  assigning 

f(u|e,) 

u to  Tll  if  p(fl,u)  a q~f (u'I'q^)  > 1 (3.1) 

u to  TTg  otherwise 

where  9 **  9^  U 0^  is  the  entire  set  of  distinct  parameters  of  the  problem. 
Equivalently  any  monotonic  increasing  function  of  p , say  h(p)  for  every 
fixed  u will  also  do,  so  that  any  h(p)  may  be  denoted  as  an  allocatory 
population  discriminant.  "Linear"  theory  is  then  surely  optimal  whenever 
there  exists  an  h(p)  which  is  linear  in  u,  although  there  are  other  cases 
as  well.  The  multivariate  normal  distribution  with  equal  covariance  matrices 
is  an  example  of  the  logistic  class  which  always  yields  a linear  population 
discriminant  because  of  the  form  of 

or+u'u 

p(9»u)  = e (3.2) 

where  a'  »■  and  consequently  log  p(0,u)  is  linear  in  u . 

J.  A.  Anderson  [19731  points  out  that  multivariate  independent  dichoto- 
mous variables  as  well  as  several  other  interesting  cases  also  belong  to  the 
logistic  class.  In  fact  this  type  of  linearity  remains  valid  for  a special 
case  of  the  general  exponential  family  where  0^  is  the  set  of  parameters 
(01,t)  and 

f(u|et)  = gCP^TMujOe 
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(3.3) 


where  Is  a p-dimensional  vector  and  T is  a set  of  extraneous 

parameters.  But  there  are  also  other  possibilities  for  linearity,  e.g., 
two  multivariate  "student"  distributions  that  differ  in  their  location 
but  have  the  same  covariance  matrices  and  equal  prior  probabilities  Geisser 
(I966].  Here  the  rule  (3.1)  is  equivalent  to  a rule  linear  in  u derived 
from  a positive  root  of  p(0,u)  . The  rules  conform  exactly  even  though 
the  positive  root  of  p(0,u)  is  nonlinear.  For  a slightly  wider  class 
of  which  the  above  is  a special  case  see  Enis  and  Geisser  [197^]*  Exact 
linear  theory  is  then  only  strictly  appropriate  for  restricted  sets  of 
distributional  assumptions  though  somewhat  wider  than  the  logistic  family. 
However,  it  is  generally  hoped  that  it  will  give  reasonably  robust,  if  less 
than  optimal,  solutions  to  many  other  cases.  There  are  situations,  however, 
where  it  certainly  should  not  be  applied,  e.g.,  where  two  normal  populations 
have  the  same  mean  but  differ  in  their  covariance  matrices . Here  linear 
discriminants  will  be  quite  inappropriate.  This  model  reflects  to  a degree 
the  situation  arising  in  discriminating  between  fraternal  and  ncnfratemal 
twins,  see  e.g.,  Richter  and  Geisser  [i960],  Okamoto  [I96I],  Geisser  and 
Desu  [1968],  Desu  and  Geisser  [1973] » Geisser  [1973®] • 

However,  except  for  special  situations  as  just  described,  it  is  usually 
assumed  or  piously  hoped  that  linearity  will  be  at  least  a not  unreasonable 
first  approximation.  By  this  is  implied  that  the  rule  (3.1)  can  be  replaced 
by  a rule  linear  in  u without  great  loss.  For  a contrary  view  in  taxonomy 
see  Reyment  [1973]*  *n  c^e  classical  frequential  paradigm  often  an  estimate 

of  q is  plugged  into  (5.1)  while  is  resolved  into  its  constituent  sum 

-1  * * -1 
log  with  an  estimate  for  uq  plugged  in  and  log  q^q^  assumed 
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to  be  a particular  value,  often  0 for  convenience  resulting  in  a discrimina- 
tory blur.  Sometimes  q 1 is  derived  frolh  a model,  Geisser  [1973°]*  or 
estimated  from  previous  data  or  from  the  data  at  hand,  when  the  situation 
permits  Geisser  [1964].  Now  when  ctQ  and  ot  are  known  any  h(p)  , of  course, 
will  do  as  well.  However,  depending  on  which  h(p)  and  what  is  used  for 
its  estimation  when  the  parameter  values  are  only  estimable  from  data,  the 
sample  discriminant  or  rule  for  allocation  will  in  general  vary.  One  way 
around  this  is  to  use  maximum  likelihood  or  any  other  estimator  which  will 
preserve  the  invariance  of  the  rule.  For  a discussion  of  some  of  these  and 
related  points  see  Geisser  [1969,1970]  and  Desu  and  Geisser  [1973] • To  do 
otherwise  reqires  that  the  statistician  decide  on  whether  the  rule  is  para- 
mount or  the  estimation  of  a particular  discriminatory  function  h(p)  is 
crucial.  Of  course  for  large  samples  the  discrepancy  may  be  quite  neglible. 

However,  as  was  noted,  the  logistic  model  itself  encompasses  a variety 
of  possible  distributional  assumptions.  While  presumably  robust  for  its  class 
when  its  parameters  are  estimated  it  is  not  expected  to  yield  as  efficient  a 
procedure  when  compared  to  one  that  is  based  on  the  true  member  of  the  class . 
For  a logistic  and  normal  comparison  see  Efron  [1975]* 

Another  classical  approach,  Wald  [1944],  Anderson  [1958,  141-2],  is  via 
the  tasting  of  hypotheses.  Here  one  computes  the  likelihood  ratio  test  of 
the  hypothesis  that  the  new  observation  belongs  to  either  of  the  two  popula- 


tions under  scrutiny.  More  specifically  if 
known  to  be  from  TT^  , then 


Max  fCxJe^  f(x2|eg)  f(uj  ex) 
8 


is  the  set  of  observations 

(3*4) 


q2  -1 

u is  assigned  to  TT.  if  X > — , TT  otherwise.  Hence  q,q„  X may 

1 ql  2 1 2 
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be  termed  Che  likelihood  ratio  allocatory  discriminant. 


For  the  multivariate  normal  case  with  equal  covariance  matrices, 


X = 


1 + H v”^(N  +l)-1(u-x  )S-1(u-x  ) 

d d d 

_ — T 


1 + N^v  ^(N^-f-l)  *(u-x^)  S ^(u-x^) 


(vf3)/2 


(3.5) 


where  x^,  is  the  sample  mean  of  independent  observations  represented 

by  and  known  to  have  originated  from  and  S is  the  usual  unbiased 

estimate  of  £ with  v = + N^-2  degrees  of  freedom,  v > p . 

It  is  interesting  to  note  that  it  is  no  longer  necessarily  possible  to 
recover  a linear  discriminant  from  this  procedure  except  under  the  rather 
restrictive  assumption  that  vf3)j^(i^+l)  = ^^N^N^+l)  . Of 

course  satisfaction  is  guaranteed  if  both  q^=q2  and  N^=N0  * Although 
this  may  be  disconcerting,  it  is  not  surprising  as  the  thrust  here  is  essenti- 
ally on  a rule  (or  test)  rather  than  on  the  estimation  of  a true  underlying 
linear  population  discriminant.  Although  the  likelihood  ratio  discriminant 
for  this  paradigm  is  equivalent  to  a rule  based  on  a quadratic  discriminant 
it  approaches  linearity  for  large  and  so  that  for  large  enough 

sanyles  there  will  be  virtually  little  difference  between  it  and  the  "usual" 
plug-in  estimate  (rule) 

V - (u-|  (xL  + x^,)]'  S"1(x1-x2)  + log  ~ (3.6) 


for  the  true  population  discriminant  (rule) 

U*  = (u-7;  (p,+u0)]  E (uj-pJ  + log  ~.  (3.7) 

c.  *-  * *•  d 

The  rule  indicated  by  (3.5)  was  shown  to  be  an  admissible  Bayes  rule 
by  Kiefer  and  Schwartz  (Inland  also  Das  Gupta  [1 965]  for  this  allocation 
problem.  However,  the  proper  prior  distribution  which  is  utilized  to  prove 
the  admissibility  is  one  that  most  Bayesians  would  consider  grossly  deficient 
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in  that  it  depends  on  the  sum  of  the  sample  sizes  and  only  assigns  non-zero 
density  to  functions  of  £ , in  a space  restricted  to  p-dimenstons 

whereas  the  set  (£  , Pp)  ostensibly  contains  (l/2)p(p+5)  parameters. 
This  does  not  say  too  much  for  the  Bayes  admissible  character  of  the  rule. 
Whether  a proper  prior  can  be  obtained  which  does  not  have  these  drawbacks  is 
an  open  question,  but  it  is  not  likely. 

Another  Fayesian  derivation  given  by  Geisser  [1964]  uses  the  simple  im- 
proper prior  density 


P+1 


* 

and  also  results  in  a quadratic  rule  in  general.  Hence  V is  not  recoverable 
for  arbitrary  values  of  N^,  N^,  p,  and  q^  , Geisser  [1966],  but  it  is  re- 
coverable except  for  an  additive  constant  depending  on  a particular  relation- 
ship existing  among  these  values,  Enis  and  Geisser  [1974].  It  is  only  fully 
recoverable  for  the  special  case  N^=Ng  and  q-j=q0  • Hence  on  a strictly 
allocatory  basis  the  linear  discriminant  V has  not  been  found  to  be  admissible. 

A semi-Bayesian  justification,  Geisser  [196^],  based  on  the  aforemen- 

-* 

tioned  improper  prior,  focuses  on  the  Bayesian  estimation  of  U , rather  than 

* 

on  allocation.  This  approach  yields  for  the  posterior  expectation  of  U 
e(u*|u)  = V*+  £ (n"1-^1)  (3.9) 

which  all  but  recovers  the  linear  rule  (5*6)  and  completely  so  whenever  Nj=>Ng>  • 

Elaborations  of  the  use  of  this  method  are  presented  by  Enis  and  Geisser  (1970J. 

Another  Bayesian  approach  Enis  and  Geisser  [1974],  which  stresses  linearity  also 

* 

yields  results  close  to  the  rule  V.  Here  one  determines  that  linear  function 
which  maximizes  the  PCC  with  respect  to  the  predictive  distribution  of  the  obser- 
vation to  be  classified.  Here  an  allocatory  notion  is  utilized  as  the  separa- 
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tory  criterion  with  discriminants  restricted  to  a linear  class.  This  attempts  to 

"optimally"  oarpromise  the  allocatory  needs  with  the  desirability  of  linearity. 

In  both  normal  and  non-normal  applications,  V is  often  utilized 

although  it  is  not  clear  whether  this  emanates  from  the  fact  that  the  nor- 

* 

mal  population  discriminant  U is  linear  and  both  allocatory  and  separatory 

* * * 
and  V is  a good  estimate  of  U , or  Chat  V for  q^  = q^  can  be  derived 

as  the  "best"  separatory  linear  discriminant  in  a distribution  free  setting 
utilizing  the  sample,  Fisher  [1936].  Basically  it  appears  that  for  many  less 
sophisticated  users  of  the  technique  it  is  both  the  simplicity  of  linearity 
combined  with  the  authority  of  Fisher  that  is  compelling.  At  any  rate,  there 
seems  to  be  a bias  in  applications  (as  well  as  theory)  for  focusing  on 
linear  discriminants  rather  than  a quest  for  overall  optimal  allocation 
irrespective  of  the  goal.  One  has  only  to  peruse  the  discriminatory  litera- 
ture to  observe  that  almost  all  applications  are  linear  and  much  theory 
devoted  to  the  "improvement"  of  linear  estimates  of  linear  discriminants. 

We  also  note  that  even  for  the  particular  normal  distribution  setup  dis- 
cussed here  there  has  as  yet  not  been  any  completely  frequentist  rule  that 

guarantees  optimal  allocation  when  the  parameters  are  unknown  nor  a 

* 

Bayesian  rule  which  yields  V for  all  values  of  q , q0,  N..  and  N^. 

On  the  other  hand,  when  allocation  is  actually  not  the  goal,  linearity  may 
be  inherently  more  useful  (certainly  descriptively)  because  of  its 
simplicity  in  discussing  certain  issues  > and  in  the  normal  case  both 
frequentist  and  Bayesian  estimation  procedures  will  yield  linear  sample 


discriminants , 


If.  Using  Linear  Discriminants — Normal  Case. 


As  in  the  previous  section  let  X^,  i = 1,  2 represent  a set  of 
observations  known  to  be  from  TT^,  a N(|i^,  £ ) population.  The  ob- 
ject is  to  optimally  allocate  a new  observation  u which  lias  prior 
probability  q^  of  being  from  TJ^.  We  then  assign  a prior  proba- 
bility g(p^,  iv,’  £ ) to  the  unknown  set  of  parameters.  Hence 

PrfTTju]  q^Mx,  TTX] 

R = Prfltg|uJ  = q2f(u|x,  TT2]  (,Kl) 

where  X = (X  ^ , X^)  and 

f(u|x,  TTi)  « Jf  (u  | , £ JpCuLj,  , u2,  £ |x)dp,1dp,2d  £,  (If. 2) 

the  predictive  density  of  a future  observation  where 

|x)  Pg  , ?.J  |x)  g(u^)  £)  • (^-3) 

This  then  provides  the  solution  for  the  allocation  of  the  next 
observation  and  can  be  used  on  all  further  observations.  This 
latter  use  is  not  optimal  as  the  predictive  distribution  of  a set  of 
new  observations  is  dependent  and  here  it  would  be  utilized  as  if 
they  were  independent  (for  the  optimal  solution  see  Geisser  (1966)). 

At  any  rate  the  solution  is  optimal  for  the  next  observation  u. 
However  one  is  in  quandary  as  how  to  calculate  a joint  prior  distri- 
bution for  p^,  p2>  and  £ that  realisticrlly  reflects  prior  know- 
ledge one  may  have  about  them.  One  way  out  of  this  dilemma  is  to 
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-l  £-±_! 

use  the  improper  prior  g(^,  £ )a|  )£  | which  tends  to  mini- 

mize the  effect  of  the  prior  distribution.  The  results  for  this  case 
were  given  by  Geisser  (1964)  and  yields  for  the  posterior  probability 
ratio 


n1+n2-i-p 


A\7Ni+1\  2 

"(N2+l)(N1+Ng-2)+N2(u-x?)  S_1(u-x?) 

(N1+1)(N1+N2-2)+N1(u-x1)1S"1(u-x1) 

Wl 


(4.4) 


so  that  when  R > 1 assign  u to  TT^  and  to  TTg  otherwise.  This 
rule  is  in  general  quadratic  and  is  linear  only  for  very  special  cases 
among  which  is  N^=Ng  and  qj=qg,  but  tends  to  linearity  as  the 
sample  sizes  increase. 

All  evidence  to  date  indicates  that  this  procedure  is  superior  for 
allocation  than  the  plug-in  rule  V*  of  (3*6).  In  this  regard  admis- 
sibility was  previously  discussed.  From  the  point  of  view  of  density 
estimation  the  predictive  density,  as  generally  suggested  in  Geisser 
(1971),  is  shown  by  Aitchison  (1975)  to  be  a better  estimate  of  the  true 
density  in  this  case  than  what  results  from  plugging  in  the  maximum 
likelihood  estimates  into  the  known  normal  density  (which  is  basically 
the  rule  V*)  by  a "frequentis t"  goodness  of  fit  criterion  based  on  the 
Kullback-Leibler  (I95O  directed  measure  of  divergence. 

On  the  other  hand  the  use  of  V (V*  with  q^  =>  qg)  as  a separatory 
function  can  be  made  compelling  or  approximately  so  even  when  based 
probabilistic  criterion  of  the  kind  discussed  in  (2.10),  Enis  and 
Geisser  (197*0 . Neither  in  its  form  nor  its  interpretation,  is  (4.4) 
very  appealing  for  separatory  purposes,  while  using  V as  a separa- 
tory function  seems  to  be  very  attractive  for  many  applications. 
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If  one  then  were  satisfied  with  V as  a separatory  function  and  decided 
to  use  V*  as  well  in  the  allocatory  mode  as  in  most  applications,  what 
can  we  say  about  its  properties,  i.e.,  how  good  an  allocator  is  it.  Be- 
fore answering  this  let  us  examine  the  allocatory  prowess  of  U*  when 
the  parameters  are  known--the  best  possible  situation.  Then,  letting  0 

stand  for  u,  , ijl  and  Zj, 

JL  d 

pcc  = y(q)  = q^Ce)  + q2v(e) 

where 

VjCe)  « Mu*  > o|.tx,  e]  j 

Y2(0)  = Pr[U*  < 0|TT2,  03,1 

7^(9)  being  the  probability  of  U*  correctly  classifying  an  observa- 
tion emanating  from  TT^ . It  is  easily  shown,  Geisser  (1967),  that 

Yx(e)  - 1 - *(tl) 

Yg(0)  = «(t2) 

_1  ^ _1  4 

where  ^ » (log  - &*)/«*• , t?  = (log  q^"  + &*)/or 

, -1 

and  = (Uj  - Jig)  £ (p,x  - U2). 

Hence  a "plug-in"  estimate  of  the  best  one  can  do  is 
y(9)  = qx(l  ” $(t^))  + q2$(Tg)  and  and  Tg  are  estimated  by  em- 

ploying  Q = (x^  - Xg)  S (x^  - Xg)  as  an  estimate  for  a where 

in<^  are  unknown.  For  a Bayesian  estimate  of  7(9)  which  employs 
B6V(6)  - Y,  see  Geisser  (1967,  1970).  It  must  be  noted  that  thi»  is  an 
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estimate  of  the  best  that  can  be  done  in  terms  of  v(0)  and  not  an 
estimate  of  what  may  be  achieved  with  a given  V*  when  the  parameters 
are  unknown.  When  one  actually  uses  V*  then  we  have  the  conditional 
or  Actual  PCC(APCC) 

APCC  = 6(0,  6)  = q161(e,  0)  + q2&2(6,  0)  (li.9) 

where 

Me.  9)  = Pr(v*  > ojir , e,  9)  - 1 - *00  \ 

(4.9) 

6p(0.  9)  = Pr(V*  < o|TT  0,  e)  = *(ll  ) ) 

where  l 

\=  {(i(x1+xJ)-li.)'s"1(x1-x£?)+  log  "l)/  ^(V*2)'S’lSS'*l(*l‘*2)]^  (4-9) 

i ss  1,  2.  A naive  estimate  6(0,  0)  » 6(0,  0)  turns  out  to  be  y(0)>  so 
that  the  estimate  for  APCC  is  the  same  as  for  the  PCC  which  is  most  un- 
satisfactory and  has  led  to  much  effort  by  frequentists  in  attempting  to 
correct  that  estimate  of  APCC  or  its  peculiar,  companion  E[APCC]  = Eg(6(0,  0)J, 
Hills  (1966).  In  fact  for  a long  time  the  various  possible  probabilities  of 
correct  classification  were  confused  and  the  subject  in  somewhat  of  a chaotic 
state  until  Hills  (1966)  presented  a careful  analysis  of  the  various  fre- 
quentist  allocation  error  rates.  For  some  further  remarks  see  Geisser  (1969* 
19T0)*  From  a Bayesian  point  of  view  the  problem  as  such  completely 

A 

disappears  by  using  as  estimators  for  6^(9,  0)  its  posterior  expectation 

A 

6^  = Eg(6^(0,  0))  which  yields  Geisser  (1967) 


> 


+ N„  - 1 


6.,  = pr 


'N,  + N, 


(log  q^"1  - ^Q)(N1  + ~ 1 " P)^ 

l (M1  +N?  - i?)(N1  + W 

(log  q2q^  1 + -^Q ) (Nj^  + N„  - 1 - p)V_^ 
t(Nx  + N2  - 2)(Np  + 1)0? 


(l».10) 


where  t.,  , is  the  student  "t"  random  variable  with 

N.  + N - 1 - p 

X G 

+ N0  - 1 - p degrees  of  freedom.  Clearly  then  6 < V since 
6(9)  < v( 0 ) as  required.  Actually  the  Bayesian  estimate  of  y(9) 
is  rather  difficult  to  compute  explicitly  but  Y = Yes)!  for  a better 
approximation  see  Geisser  (197®)*  Note  also  that  6 < y(9).  At  any 

. A 

rate  a clear  interpretation  emerges — Y or  y( 0)  is  an  estimate  of 
what  potentially  could  be  achieved  with  sample  sizes  very  much  larger 
than  those  in  hand  whilst  6 is  what  is  actually  achievable  with  the 

A 

data  in  hand*  In  other  words  if  say  y(0)  is  large  enough,  then  the 
discriminatory  variables  are  satisfactory.  However  the  user  may  not 
be  satisfied  with  an  appreciably  lower  value  of  6.  But  then  the 
remedy  is  clear,  one  needs  larger  sample  sizes  until  6 is  close 
enough  to  y( 0 ) to  be  satisfactory.  If  y(0)  is  not  large  enough  to 
suit  the  purposes  of  the  allocation  then  one  must  find  other  discrimin- 
atory variables. 

While  1 = + q JL  is  an  estimate  of  the  APCC,  it  turns  out 

II  r d 

that  6 is  exactly  the  predictive  probability  of  correct  classification 
(EPCC)  using  V- , though  not  optimal  unless  q^=*  q^  and  Nj=  Ng  . Even 
when  these  conditions  do  not  hold  it  should  not  be  too  far  from  optimal  as  it 
approaches  optimality  for  large  Nj,  and  Ng  . Further  § is  also  a useful 
guide  in  determining  which  variables  may  be  omitted  in  measuring  future  obser 
vations.  For  example  it  can  happen  for  economic  or  other  reasons  that 
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only  a subset  o£  r of  the  p original  variables  can  be  utilized  for 
allocating  future  observations . Then  one  could  compute  6 weighted  by 
an  appropriate  cost  or  utility  factor  for  each  subset  of  r out  of  the 
p variables  in  order  to  make  an  optimal  determination. 

At  any  rate  this  approach  yields  sensible  answers  when  one  uses 
the  usual' linear  discriminant  for  allocation.  Slight  improvements  can  be 
made  by  some  adjustment  of  V within  the  Bayesian  framework  as  noted  by 
Geisser  (l9fc>7)  and  Enis  and  Geisser  (197*0  > hut  it's  not  likely  the 
effect  will  be  significant.  Extension  to  r > 2 populations  through- 
out or  unknown  presents  no  intrinsic  difficulty,  see  Geisser 

(1964,  1967). 
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5.  Maximizing  Measures  of  Spread  for  Linear  Discriminatory  Forme, 

When  there  are  no  appropriate  distributional  assumptions  one  can  pro- 
ceed by  both  choosing  a class  £(u)  of  din criminatory  functions,  linear, 
quadratic,  etc.  and  defining  either  a distance  between  any  two  populations, 
Fisher  [193^3,  or  a more  general  measure  of  spread  amongst  all  of  the  popu- 
lations. A minimal  set  of  "best"  discriminants  then  presumably  would  be 
selected  from  all  of  those  solutions  that  maximize  the  spread  with  respect 
to  the  parameters  of  the  discriminatory  functions  given  the  constraints 
under  consideration.  These  discriminants  then  can  be  used  to  completely 
characterize  the  differential  aspects  of  the  populations  with  respect  to 
the  manifest  variables. 

Let  us  further  assume  all  of  the  distributions  of  the  r populations 
are  roughly  the  same  in  that  they  enjoy  approximately  the  type  of  clustering 
and  symmetry  about  their  mean  vectors  exhibited  by  a set  of  multivariate  nor- 
mal densities  with  equal  covariance  matrices.  Basically  then  the  in*>ortant 
differences  are  in  the  location  of  these  central  vectors.  Fisher  (1936)  then 
found  it  sensible  for  r populations,  to  find  the  set  of  linear  combinations 
c'u  which  maximized  pairwise  the  distance  functions-  [c '(p,  -p,  ) }2/c ' £ c , 
i*J-l.  ....  r where  £ was  assumed  to  be  the  common  covariance  matrix.  This 
generates  the  optimal  reduced  set  of  linear  discriminants  previously  obtained 
where  multivariate  normal  theory  was  assumed.  The  technique  used  by  Fisher 
was  essentially  differentiation  with  Lagrange  multipliers.  An  alternate  geo- 
metric derivation  is  given  by  Dempster  [1 969]. 

There  are  other  methods  of  obtaining  these  linear  discriminants,  which 
involve  maximizing  some  measure  of  spread,  Wilks  (196?).  The  technique  used 
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for  the  maximization  by  Wilks  also  involved  Lagrange  multipliers.  We  now 
present  an  alternate  derivation,  Geisser  (1973^)  which  is  completely  algebraic 
and  somewhat  more  general.  Again,  suppose  there  are  r p-dimenaional  multi- 


variate populations  with  means  p^,  . ..p,f  and  common  positive  definite  covariance 
matrix  2)  * Further  let  g,  of  rank  r-v  < ft  be  defined  as  previously  in  Section  2. 
Assume  that  g(gZ)  S * s( 6 • * • * » 6r  y)  Is  any  scalar  measure  of  the  spread 
of  these  r populations  that  is  increasing  in  the  non-zero  roots  of  g £ 

6^5:  ...  £6r  y>  0 . Suppose  further  we  transform  these  r p-dimensional 
populations  into  a k s p 3pace  by  a real  transformation  matrix  C. 


>-l 


'kyp 


which  is  of  rank  k . Hence  T)  ~ CM^  , i=l,...,r  , Q ■ cZ/C^ 


-1, 


T = CPC'  and  the  measure  of  spread  in  k dimensions  is  ^(iT)  )*  i.e., 

the  same  scalar  function  of  the  non-zero  roots  dj£  ...  ^ d^>  0 of  m * 


where  t * rain(k,r-v)  . Then  we  shall  show  that 


Max  gk(n_1r)  * g(61....»6t)  • ( 5*1) 

As  the  maximum  spread  is  attained  for  k * r-v  , there  is  no  interest  in  the 
discriminatory  situation  in  considering  k > r-v  . An  orthogonal  basis 
solution  for  C , when  k s r-v  , would  then  be 


C =>P^k\  (5  .2) 

where  is  as  previously  defined.  Consequently  AP^  is  the  characteristic 

t h 1 

vector  associated  with  the  j largest  root  of  gZT  . Hence  the  conjecture 

made  previously  in  Section  2 has  a basis  in  fact  if  optimization  depends 

on  maximizing  every  scalar  measure  of  spread  which  is  an  increasing  function  of 

—1 

the  non-zero  roots  of  3 • One  can  also  define  the  fraction  of  total  loss 

sustained  in  the  measure  of  spread  when  k < r - v as 


(5-3) 


g(  6j,  > • • • > 6r_v)-&( 6^»  • • • * 6^) 

8(  6]_»  • * • > 6r_v)  "" 

For  example,  if  we  are  using  either  the  "Hotelling"  or  "Wilks"  measure 
of  spread: 


|i  + pS_1I 


r-v 

= TT  (1 
i=l 


+ 6J 


(5-^) 


then 


r-v  r-v  r-v  , 

I*HaS  bJT,  t>,  , U.  - l - TT  (1+6J  (5-5) 

n i-k+1  i*l  w i»k+l 

The  algebraic  derivation  of  the  aforementioned  results  is  basically  an 
application  of  the  following  matrix  theorem. 

Theorem:  Let  Z be  a real  p x m matrix  of  rank  s ■ min(p,m)  and  E^ 

be  the  class  of  p x P real  symmetric  idempotent  matrices  of  rank  k . Then 
for  all  FEE^  the  maximum  attainable  values  of  the  first  t ordered  roots 
a^  of  z'FZ  are  q-^  , i=l,...,t  , t * min(k,ra)  , where  the  cy^'s  are  the 
non-zero  ordered  roots  of  Z'Z  . Further,  the  totality  of  solutions  for  F , 
where  the  maximum  values  of  the  roots  are  attained  is  given  by 


Y(k)DklY('k)  £or  k s " 

Z(Z'Z)  1 Z'+  Gy._m  for  k a ra 


(5.6) 


where 
and  P 


Y 


^ *(Y^, . . . ,Y^)  , represents  the  first  k 
is  the  orthogonal  matrix  such  that  P'Z'ZP 


columns  of 

* D where 
m 


Y - ZP  , 
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Now 


* * 
and  is  the  s x s matrix  in  the  upper  left  hand  corner  of  F . 

the  maximum  rank  of  Y'FY  is  t » min(k,m)  or  t * min(k,e)  . Hence  all 

solutions,  for  which  = 0^  , i«l,  t are  such  that  the  rank  of 

# 

Y'FY  must  be  t . Further  the  first  t diagonal  elements  of  F^  are  then  1 

since  £ = £ iclatCti  » 0 5 f*i  < 1 and  ^ i=»lfii  “ k mU8t  be 

satisfied.  This  implies  that  the  off  diagonal  elements  in  those  rows  and 

columns  are  zero  since  we  are  dealing  with  idempotent  matrices.  Therefore, 

# 

all  solutions  for  F are 


for  t*k,  i.e.,  ksm 


or 


(5.IO) 


for  t-m,  i.e.,  msk  , 


where  G is  any  idempotent  p-m  x P"m  matrix  of  rank  k-m  . Hence  the 
totality  of  solutions  for  F are  Fq  “ QFqQ'  > 80  that 

Fo  ■ £or  ksm  (5-u> 

which  is  unique  if  are  distinct  and 

F0  * YdJ'Y'  +Q2C3Qg  for  msk  . (5-12) 
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Note  that  from  (5.12)  and  ZP  «■  Y that 

Fo=  ZPD“Vz'  + Q^OQg  * Z(Z'Z)-1Z  + QgGQ^  for  tnsk  . (5.I3) 

Hence  set  Q^GQ^  * m * Since  ® 1®  an  arbitrary  idempotent  matrix  of 

rank  k-m  and  is  orthogonal  to  Z being  it  is  orthogonal  to  Y , then 

&k  m an  arbttrarY  idempotent  matrix  orthogonal  to  Z and  the  theorem  is 
established. 

As  an  immediate  consequence  of  the  theorem  and  the  fact  that  a^  £ ar^ 
we  have  the  following: 

Corollary 

If  g(Z‘,FZ)  = g(a^,...,a  ) is  a scalar  non-decreasing  function  of  the 
roots  a^  , then 

wax  g(a.  , . . . ,a  ) » g(of- , , . » ,Qf  ) . (5*14) 

f€e.  1 C 1 t 

k 

In  order  to  apply  the  theorem  and  corollary  we  first  note  that  the 
non-zero  roots  of  Ffi  *=  CpC  '(c  Tj  C ')  ^ are  the  same  as  the  non-zero  roots 
of  A'C'(C  Dc')  ^CA  where  {3  = AA/  and  A is  p x r-v  . Set  C z£=*  H 
where  Z ^ is  the  positive  definite  symmetric  square  root  of  Z)  so  that 
the  non-zero  roots  of  IYi  * are  the  same  as  the  non-zero  roots  of 

Set  r-v  «*  m , Z a ZT^  and  the  idempotent  matrix 
H/(HH/)  * F . Hence  as  by  our  previous  corollary 


Max  gk(rQ-1)  - Max  gR(z'FZ)  =>  g(61>...,6t)  . 

C F 

To  find  solutions  for  C we  note  that  there  is  an  orthogonal  matrix 
P such  that 

0 


P'A'  Z7  AP  - A 


m 


Vo 


nr 
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and  in  the  theorem  set  Y = ZT^AP  and  = XT^AP( ^ ) 

(P^,  . ..,  P ) is  the  matrix  consisting  of  the  first  j 
Note  also  that  AP^  is  the  invariant  vector  of  0 ZT 
the  root  6^,  i = 1,  m and  the  calculation  of  Y 

depend  on  A . Hence  from  the  theorem 


where  w 

columns  of  P . 
corresponding  to 
doeB  not 


(j) 


fo  ■ Y(k)Ak'V(k)  £oc  '‘S'-*-  (5.15) 

From  = F we  obtain  H = HF  and  noting  from  (5-15)  Chat 

Y(k >Fo  '*(’,)  thc”  "o=Y(k)  “1  °o  = ^ £0r 

k < r - v , as  required  . 

The  derivations  in  this  section  and  in  Section  ? have  been  presented 
in  terms  of  population  parameters.  But  obviously  if  sample  estimates  based 

A 

on  data  at  hand,  3 and  *->  are  utilized  there  need  not  be,  from  one  point 
of  view,  any  essential  change  other  than  "optimization"  now  takes  place  with 
regard  to  sample  estimates.  The  problem  then  is  to  decide  on  the  "plug-in" 
estimators.  The  substitution  of  unbiased  sample  moments  for  the  population 
moments  results  in  the  same  set  of  sample  discriminants  as  is  used  in  the 
normal  pase  where  maximum  likelihood  estimates  corrected  for  bias  are  uti- 
lized. Of  course  the  simplifying  assumption  that  the  multivariate  populations 
differ  mainly  in  their  locations  and  relative  to  this,  variations  in  the 
dispersion  matrices  were  unimportant  as  exemplified  explicitly  in  the  previously 
discussed  multivariate  normal  model,  was  our  guide.  This  latter  model  gave 
rise  to  linear  theory  in  terms  of  population  discriminants  and  consequently, 
it  is  no  surprise  that  focusing  on  linear  theory  can  yield  the  same  results. 
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6.  Sample  Reuse  Techniques  — Allocatory  and  Separatory 


When  precise  distributional  assumptions  are  untenable  or  abandoned 
entirely  so  that  theoretical  calculations  are  precluded,  one  is  then 
obliged  to  provide  some  other  means  for  rendering  discriminants  and 
assessing  their  quality.  In  this  section  we  shall  discuss  how  sample 
reuse  notions  can  be  directed  towards  an  evaluation  of  discriminants 
and  their  further  refinement.  Again  for  simplicity  let  us  focus  on  the 
two  population  case.  Let  the  usual  single  linear  discriminant  be 
V * V(u).  By  substituting  each  of  the  p-dimensional  observations  x^ 
for  j = 1,  . ..,  and  i * 1,  2,  in  V we  obtain  v(x^)  * v^  or 
two  sets  of  univariate  observations.  These  may  now  be  plotted  on  a 
single  axis  distinguishing  them  only  by  their  population  origin.  If 
there  are  a great  many  of  them  a histogram  for  each  set  is  visually  in- 
formative in  indicating  the  quality  the  linear  separation  induced.  If 
V*  is  to  be  used  for  allocatory  purposes  then  some  assessment  of  the 
APCC  is  in  order.  It  has  long  been  clear  that  the  naive  assessment  of 
the  APCC,  by  merely  calculating  q^n^N^  + where  r»^  represents 

the  number  of  x^'s  that  are  correctly  classified  by  V*,  will  be  too 

A 

large  - just  as  in  the  normal  case  y(9)  was  generally  too  large  as  dis- 
cussed in  section  4.  A sample  reuse  technique  for  correcting  this  flaw 
was  proposed  by  Lachenbruch  (1965).  He  proposed  calculating  VJj(u)  the 
linear  discriminant  with  omitted  and  then  computing  VJj(x^)  * 

and  classifying  x. . on  the  basis  of  whether  u , exceeded  or  fell  short 
1 J X J 

of  0.  An  adjusted  estimate  of  APCC,  q^n^N^  + 92n2^2^’  °^>ta^tie^ 
where  n^  represents  the  number  of  x^'s,  i » 1,  2,  correctly  classi- 
fied. Note  that  if  N^/(N^  +N^)  is  appropriate  as  an  estimator  of  q^ , 
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when  it  is  unknown,  that  Che  estimator  for  the  APCC  becomes  (n|  + n'^)! 

(N^  + N,).  This  is  reasonable  in  situations  where  the  initial  sample  of 
size  N = + Ng  is  drawn  at  random  from  ff  = IT^  (J  so  that  the  ran- 

dom frequency  N^,  provides  information  on  q^ . 

This  method  may  also  be  of  value  in  the  determination  of  which  vari- 
ables could  be  eliminated  or  which  subset  of  r out  of  the  p would  be 
optimal  in  future  measurements  as  discussed  in  section  4. 

One  could  attempt  a finer  tuning,  as  it  were,  by  applying  the  predic- 
tive sample  reuse  (PSR)  method,  Geisser  (1975)-  A criterion  applicable 
here  would  be  to  maximize 


p(u0)  = q1n1(u0)N"1  + q2n3(u0)N"1 


(6.1) 


with  respect  to  u^,  where  °^(uq)  represents  the  number  of  x^'s 

correctly  allocated  by  such  that  x^  is  allocated  to  Ti^  if 

V*  (x.  .)  = u.  . > u„  and  to  IT  otherwise.  One  would  order  the  scalar 
ij'  i j ij  0 2 

values  u^j  and  find  that  cutoff  point  u^  which  maximizes  P(u^),  This 
can  easily  be  done  numerically  as  it  is  essentially  a counting  procedure. 
Convenient  algorithms  can  be  found  to  shorten  the  process.  While  it  is 

* A 

also  clear  that  u^  need  not  be  unique,  it  can  be  made  so  by  arbitrarily 

A 

selecting  a particular  one  of  them,  e.g.  the  maximizer  closest  to 

. . ">  * 

zero.  Then  for  future  allocation  one  uses  V*(u)  < u^  as  the  allocatory 

discriminant.  One  could  also  alter  the  criteria  when  q.  is  unknown  and 

A A -1  A “1  -1 
maximize  P(u0)  =*  qln1(u0)^1  + V*2^U0^N2  * If  Ni^Nl  + N2^_  can  be  used 

as  an  estimator  for  q^  then  the  new  criterion  effectively  maximizes  the 

i A > 

total  number  of  x^j  s correctly  classified  by  V*\  < u^,  and  again  one 
would  use  V*(u)  ^ Uq  as  the  allocator  where  Vf\  and  V*(u)  are  merely 
and  V*,(u)  respectively  with  q^q”^  replaced  by  N^N^. 
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An  appropriate  assessment  of  the  new  discriminant  would  require  a 
two-deep  cross -validatory  assessment  i.e.  of  the  Lachenbruch  (1965)  type. 
There  are  obviously  ways  of  applying  the  PSR  approach  to  linear  discrimi- 
nants other  than  the  cut-off  point  allocatory  approach  illustrated  here,  e.g. 
estimating  by  PSR  the  linear  regression  coefficients. 

If  one  is  concerned  mainly  with  the  reliability  of  a discriminant  in 
its  separatory  role,  then  Mos teller  and  Tukey  (i960)  and  Lachenbruch  and 
Mickey  (1968)  suggest  jackknifing  V.  First  one  calculates  the  set  of 
pseudo-discriminant  functions  Vjj  * (N^  + N^)v  - (N^+  l)V^ 

j = l,  . . . , N , , i » 1 , 2 > and  then  V ' =*  (N..  + Nr  ) * £ V 1 , , which  is 

* i,j  J 

termed  the  jackknifed  discriminant.  One  can  compute  the  reliability  of 
V*  in  terms  of  the  variation  of  the  regression  coefficients  of  the  V^, 
the  individual  values  averaged  to  compute  V*.  Examining  the  ratio  of  a 
regression  coefficient  in  V*  to  its  sample  standard  error  permits  a 
judgement  on  the  significance  of  its  deviation  from  zero.  The  main  point 
of  this  exercise  is  to  assess  to  some  degree  the  reliability  of  the  jac- 
knif.ed  discriminant  function  in  its  separatory  role.  Again  if  one  decides 
to  use  V*  (or  V*')  as  an  allocator  one  can  assess  it  by  using  a 
two-deep  cross -validatory  approach. 
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7.  Remarks . 


I have  attempted  to  carefully  delineate  two  distinct  purposes  of 
discriminatory  analyses  and  to  examine  the  linear  aspects  involved. 
However  linearity  has  been  surveyed,  as  it  were,  from  a personal  (not 
to  be  confused  with  personalis  tic)  point  of  view,  in  that  much  work 
on  the  linear  aspects  of  normal  parametric  discrimination  has  not 
been  mentioned  chiefly  because  it  involves  the  generation  of  modest 
improvements  with  regard  to  certain  frequency  properties  by  some 
slight  alteration  of  the  linear  discriminants.  It  is  my  contention 
that  here  the  Bayesian  approach,  or  when  adjustments  are  indicated, 
a Bayesian  type  of  adjustment  will  yield  better  results  than  fre- 
quential  tinkering.  When  parametric  assumptions  are  fuzzy  or  non- 
existent, sample  reuse  methods,  which  are  frequency  oriented  predictive 
simulation  techniques,  should  serve. 

Finally  it  must  be  borne  in  mind  that  Discrimination  is  a technique 
which  is  often  most  useful  in  the  early  history  or  soft  stage  of  a 
discipline  when  notions  are  fuzzy,  measurements  crude  or  indirect  and 
relationships  vaguely  understood  at  best.  Hence  it  is  generally  an 
appreciable  improvement  of  whatever  has  gone  before--theoretical 
niceties  notwithstanding.  No  doubt  Linear  Discrimination  fulfills 
the  role  played  by  Barnard's  (1972)  "midwife"  in  fostering  the  parturi- 
tion of  pertinent  distinctions,  probabilistic  or  classificatory , during 
the  birthpangs  of  a scientific  discipline — but  soon  abandoned  or  its 
focus  shifted  as  the  discipline  hardens. 
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ABSTRACT  ( Contlnu * on  rovotoo  otdo  It  noc ooomry  *nd  Identity  by  block  nimbor) 


This  paper  delineates  two  distinct  purposes  of  discriminatory  analyses — 
allocation  and  separation — and  examines  the  linear  aspects  involved. 

After  a review  of  multivariate  normality  and  a discussion  of  the  extent 
to  which  linearity  is  optimal,  suggestions  are  made  as  to  the  actual  use 
of  linear  discriminants.  Then,  dropping  distributional  assumptions,  we 
focus  on  the  separation  of  populations  via  linear  functions.  We  also 
discuss  the  application  of  sample  reuse  procedures  to  linear  discriminants. 
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